Design Consideration for Multi-lingual Cascading Text Compressors
نویسندگان
چکیده
In this paper, we studied the cascading of LZ variants to Huffman coding for multilingual documents. Two models are proposed: the static model and the adaptive (dynamic) model. The static model makes use of the dictionary generated by the LZW algorithm in the Chinese dictionary based Huffman compression to achieve better performance. The dynamic model is an extension of the static cascading model. During the insertion of phrases into the dictionary, the frequency count of the phrases is updated so that a dynamic Huffman tree with variable length output tokens is obtained. The static cascading of the LZW and the Huffman coding can be described as a two-pass compression model. In the first pass, the “LZW dictionary” is generated. It is done by capturing all the dictionary entries during the LZW compression process into the “LZW dictionary”. However, in this process, many dictionary entries that are not used in the compression process will also be picked out and the final “LZW dictionary” will be very large. This increases the overhead of the header and results in transmitting a lot of unnecessary information. To make up for this loss, we propose a new method to capture the “LZW dictionary” by picking up the dictionary entries during decompression. The general idea is: adding delimiters during the decompression process so that the decompressed file are segmented into phrases that reflect how the LZW compressor makes use of its dictionary phrases to encode the source. The idea of the adaptive cascading model can be thought as an extension of the Chinese LZW compression. Since the size of the header is one important performance bottleneck in the static cascading model, we propose the adaptive cascading model to address this issue. The LZW compressor is now outputting not a fixed length token, but a variable length Huffman code from the Huffman tree. It is expected that such a compressor can achieve very good compression performance. In our adaptive cascading model, we choose LZW instead of LZSS because the LZW algorithm preserves more information than the LZSS algorithm does. This characteristic is found to be very useful in helping Chinese compressors to attain better performance. It is because repeated Chinese phrases can be found only if large amount of previous contents are available. Based on our multilingual corpus, our adaptive cascading scheme can perform better than the well-known cascading compressor, gzip, by an average of about 20%. This work was supported in part by the Advanced Research Grant RP960686 and RP970630 of the National University of Singapore. 520 106%0314/99 $10.00
منابع مشابه
English-Persian Plagiarism Detection based on a Semantic Approach
Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...
متن کاملIdentifying Similarity in Text: Multi-Lingual Analysis for Summarization
Identifying Similarity in Text: Multi-Lingual Analysis for Summarization
متن کاملText categorization on a multi-lingual corpus
This paper presents experiments with a hierarchical text categorizer on a multi-lingual (English, French) corpus. The results obtained are very similar for both languages. The results allow us to apply in the near future cross-language text categorization that can be used to support automatic translation to create multi-lingual topic glossary.
متن کاملMulti-Lingual Text Generation and the Meaning-Text Theory
We describe multi-lingual text generation as an alternative to automatic translation in specified technical sublanguages, illustrating the notion with the implemented RAREAS-2 system for synthesizing marine weather forecasts in English and French. We then review the Meaning-Text Theory (MTT) of Mel'cuk et al. as we have applied it to text generation in the GOSSIP system for producing English re...
متن کاملEnhancing Multi-lingual Information Extraction via Cross-Media Inference and Fusion
We describe a new information fusion approach to integrate facts extracted from cross-media objects (videos and texts) into a coherent common representation including multi-level knowledge (concepts, relations and events). Beyond standard information fusion, we exploited video extraction results and significantly improved text Information Extraction. We further extended our methods to multi-lin...
متن کامل